Airbnb Data Analytics

Using the RSQLite database setup in the previous post to answer several related data analysis questions

Nils Dosaj Mikkelsen https://example.com/norajones
2022-04-21

Introduction

In this post, we demonstrate how to answer a number of questions related to room listing type and host info using our newly created RSQLite database.

Questions

We answer the following 11 questions:

Room Listing

  1. What is the most common room type available?
  2. What is the top and bottom 10 property types based on average price?
  3. What is the top and bottom 10 property types based on review score?
  4. What is the most common amenities provided?
  5. Is there any correlation between room price and the review score?
  6. Room listing geographical distribution

Host

  1. Who are the top 10 host based on revenue?
  2. Is there any difference in review score between superhost and normal host?
  3. Is there any difference in response rate between superhost and normal host?
  4. What is the most commonly verified host information?
  5. How has the number of hosts joining airbnb increased/decreased overtime?

Libraries, Database Connection and Functions

Libraries

First, we must load the necessary libraries required for our data analysis and connect to our database.

Database Connection

Next, we connect to our database that we set up in the previous post

con <- dbConnect(RSQLite::SQLite(), "../airdb.SQLite")

The build_airbnb_database() Function

If you have not yet set up the require database, you can use the build_airbnb_database() function to do so. Note that this function leverages the remove_old_database() and insert_to_sql() functions defined in the previous post.

Show code
build_airbnb_database <- function(con, listing_data, remove_old_database = FALSE){
    
    #################### Remove Existing database
    if(remove_old_database == TRUE){
        remove_live_database(con)
    }

    #################### Deal with NA values 
    listing_data <- 
        listing_data %>% 
        
        # Convert dates to characters for NA values
        mutate(last_scraped = as.character(last_scraped),
               host_since = as.character(host_since),
               calendar_last_scraped = as.character(calendar_last_scraped),
               first_review = as.character(first_review),
               last_review = as.character(last_review),
               ) %>% 
        
        # Homogenize NA values
        #*# Taken from: https://rpubs.com/Argaadya/create_table_sql
        mutate_all(function(x) ifelse(x == "" | x == "None" | x == "N/A", NA, x)) %>%  #*#
        # mutate_all(function(x) ifelse(is.na(x), "NULL", x)) %>% 
        
        # Convert character strings back to date type
        mutate(last_scraped = as.Date(last_scraped),
               host_since = as.Date(host_since),
               calendar_last_scraped = as.Date(calendar_last_scraped),
               first_review = as.Date(first_review),
               last_review = as.Date(last_review)) 
    
    
    #################### Extract host data
    host_data <- listing_data %>% 
        select(host_id:host_identity_verified, 
               calculated_host_listings_count:calculated_host_listings_count_shared_rooms)
    
    
    #################### Remove duplicate values
    host_data <- host_data %>% distinct()
    
    
    #################### Convert dates
    # Note that this will need to converted back to type = date for analysis
    host_data  <- host_data %>% mutate(host_since = as.character(host_since)) 
    
    
    #################### Clean host verification column
    host_data <- 
        host_data %>% 
        mutate(host_verifications = str_remove_all(host_verifications, "[\\'\\[\\]]"))
    
    
    #################### Create table for host info
    query <- "CREATE TABLE host_info(
        host_id INT, 
        host_url VARCHAR(50), 
        host_name VARCHAR(100), 
        host_since VARCHAR(50),
        host_location VARCHAR(500), 
        host_about VARCHAR(10000),
        host_response_time VARCHAR(50),
        host_response_rate VARCHAR(50),
        host_acceptance_rate VARCHAR(50),
        host_is_superhost BOOLEAN,
        host_thumbnail_url VARCHAR(500),
        host_picture_url VARCHAR(500),
        host_neighbourhood VARCHAR(50),
        host_listings_count INT,
        host_total_listings_count INT,
        host_verifications VARCHAR(500),
        host_has_profile_pic BOOLEAN,
        host_identity_verified BOOLEAN,
        calculated_host_listings_count INT, 
        calculated_host_listings_count_entire_homes INT,
        calculated_host_listings_count_private_rooms INT,
        calculated_host_listings_count_shared_rooms INT,
        PRIMARY KEY(host_id)
        )"
    
    
    #################### Load host_info table
    dbSendQuery(con, query)
    
    
    #################### Check schema
    res <- dbSendQuery(con, "PRAGMA table_info([host_info]);")
    fetch(res)  
    dbClearResult(res)
    
    
    #################### Insert data into host_info table
    insert_to_sql(con, "host_info", host_data)
    
    
    ####################Listing table Processing####################
    
    # listing_data %>% view() 
    listing_data %>% glimpse()
    
    #################### Remove host_data columns
    listing_data <- listing_data %>% 
        select( - names(host_data)[-1])
    
    
    #################### Remove extraneous columns 
    listing_data <- listing_data %>% 
        select(-c(license, calendar_updated, bathrooms, scrape_id))
    
    
    #################### Remove dollar signs from price column
    listing_data <- listing_data %>% 
        mutate(price = str_remove_all(price, "[$,]") %>% 
                   as.numeric()
        )
    
    
    #################### Transform amenities and host verification column
    listing_data <- listing_data %>% 
        mutate(amenities = str_remove_all(amenities, "[\"\\'\\[\\]]"))
    
    
    listing_data %>% glimpse()
    
    #################### Convert dates to character
    listing_data <- 
        listing_data %>% 
        mutate(last_scraped          = as.character(last_scraped), 
               calendar_last_scraped = as.character(calendar_last_scraped),
               first_review          = as.character(first_review),
               last_review           = as.character(last_review))
    
    
    #################### Create listing table
    query_2 <- [1856 chars quoted with '"']
    
    
    #################### Insert listing table into database
    dbSendQuery(con, query_2)
    
    
    #################### Insert data into listing table
    insert_to_sql(con, "listing", listing_data)
    
    
    #################### Extract tables from database

}

Data Loading

We can load the data from our database in either of the following ways:

  1. As data frames
host_info <- tbl(con, "host_info") %>% as.data.frame()
listing <- tbl(con, "listing") %>% as.data.frame()
  1. By querying the database
# load host_info table 
res_host_info <- dbSendQuery(con, "select * from host_info")  
host_info <- fetch(res_host_info) 
dbClearResult(res_host_info)

# load listing table
res_listing <- dbSendQuery(con, "select * from listing")
listing <- fetch(res_listing)
dbClearResult(res_listing)

Room Listing Queries

Most Common Room Type Available

What is the most common room type available?

To find the most common room type available, we start by selecting room_type and has_availability, the latter being a logical indicator of unit availability. Then we filter the rooms that have availability, group them by room_type and rank them by availability.

q1<- listing%>%
    select(room_type,has_availability)%>%
    group_by(room_type)%>%
    filter(has_availability==1)%>%
    summarise(availability=n())%>%
    arrange(desc(availability))

#Using kable for table format
most_common_room_type_available <-q1%>% 
    knitr::kable(align = c("l", "c"),
                 format.args = list(big.mark = ","),
                 digits = 2)

most_common_room_type_available
room_type availability
Entire home/apt 9,314
Private room 6,163
Hotel room 873
Shared room 658

For a clearer visual representation of room type availability, we include the following bar graph of the same data.

q1_plot<-q1%>%
    ggplot(aes(x = availability, y = room_type %>% reorder(availability))) + 
    geom_col(fill = "Aquamarine4") + 
    scale_x_continuous(labels = scales::number_format(big.mark = ",")) + 
    labs(
        title    = "Most Common Room Type Available",
        x        = "Availability",
        y        = "Room Type"
    ) + 
    theme_tq() + 
    theme(axis.text.x = element_text(face = "bold"),
          axis.text.y = element_text(face = "bold"))

q1_plot

As we see, Entire home/apt is the room type with the most available units overall, but that is also affected by the amount of Entire home/apt that are listed on Airbnb. When we take a look at the data there is also information about availability over periods of 30, 60, 90 and 365 days. This allows us to understand the average amount of nights available per room type across the different selection periods.

#Create list columns to retrieve data

availability_periods <- c('availability_30', 'availability_60', 
                          'availability_90', 'availability_365')

#Looping the results per period
for(col in availability_periods){
    tables<- paste('q1',col, sep='_')
    assign(tables,listing%>%
               select(room_type,col)%>%
               group_by(room_type)%>%
               summarise(mean=mean(.data[[col]])))
}


####################LOOP THE PLOTS


q1_availability_30_plot<-q1_availability_30%>%
    ggplot(aes(x = mean, y = room_type )) + 
    geom_col(fill = "Aquamarine4") + 
    scale_x_continuous(labels = scales::number_format(big.mark = ",")) + 
    labs(
        title    = "Most Common 30 Days",
        x        = "Availability",
        y        = "Room Type"
    ) + 
    theme_tq() + 
    theme(axis.text.x = element_text(face = "bold"),
          axis.text.y = element_text(face = "bold"))


q1_availability_60_plot<-q1_availability_60%>%
    ggplot(aes(x = mean, y = room_type )) + 
    geom_col(fill = "Aquamarine4") + 
    scale_x_continuous(labels = scales::number_format(big.mark = ",")) + 
    labs(
        title    = "Most Common 60 Days",
        x        = "Availability",
        y        = "Room Type"
    ) + 
    theme_tq() + 
    theme(axis.text.x = element_text(face = "bold"),
          axis.text.y = element_text(face = "bold"))


q1_availability_90_plot<-q1_availability_90%>%
    ggplot(aes(x = mean, y = room_type )) + 
    geom_col(fill = "Aquamarine4") + 
    scale_x_continuous(labels = scales::number_format(big.mark = ",")) + 
    labs(
        title    = "Most Common 90 Days",
        x        = "Availability",
        y        = "Room Type"
    ) + 
    theme_tq() + 
    theme(axis.text.x = element_text(face = "bold"),
          axis.text.y = element_text(face = "bold"))



q1_availability_365_plot<-q1_availability_365%>%
    ggplot(aes(x = mean, y = room_type )) + 
    geom_col(fill = "Aquamarine4") + 
    scale_x_continuous(labels = scales::number_format(big.mark = ",")) + 
    labs(
        title    = "Most Common 365 Days",
        x        = "Availability",
        y        = "Room Type"
    ) + 
    theme_tq() + 
    theme(axis.text.x = element_text(face = "bold"),
          axis.text.y = element_text(face = "bold"))


 
grid.arrange(q1_availability_30_plot, 
             q1_availability_60_plot, 
             q1_availability_90_plot,
             q1_availability_365_plot, ncol = 2)

We now see that Entire home/apt is the room type with the least average availability, combined with the fact that Entire home/apt is the room type with the highest units available we can conclude that this is also the room type with the highest demand.

Top/Bottom Property Types Average Price

Which are top 10 most expensive and top 10 cheapest property type?

First we start by selecting property_type and price from all available data, then we group by property type and summarise by average price. Then, we create two separate rankings:

  1. top_property_type_average_price
  2. bottom_property_type_average_price
q2<- listing%>%
    select(property_type,price)%>%
    group_by(property_type)%>%
    summarise(price = mean(price))%>%
    arrange(desc(price))


top_property_type_average_price <-q2 %>%
    arrange(desc(price))%>%
    top_n(10)%>% 
    knitr::kable(align = c("l", "c"),
    format.args = list(big.mark = ","),
    digits = 2)

top_property_type_average_price
property_type price
Entire vacation home 18,715.33
Shared room in casa particular 13,421.00
Entire villa 13,192.22
Private room in boat 12,000.00
Dome house 10,778.00
Barn 8,789.00
Shared room in serviced apartment 8,511.67
Shared room in residential home 7,592.49
Castle 7,500.00
Farm stay 6,195.12
bottom_property_type_average_price <-q2 %>%
    arrange(price)%>%
    top_n(-10)%>% 
    knitr::kable(align = c("l", "c"),
                 format.args = list(big.mark = ","),
                 digits = 2)

bottom_property_type_average_price
property_type price
Shared room in cave 320.00
Shared room in cabin 333.00
Shared room in dorm 353.00
Shared room in chalet 374.00
Private room in farm stay 399.00
Shared room in bed and breakfast 471.33
Private room in earth house 493.50
Shared room in villa 500.00
Shared room in townhouse 504.73
Shared room in tiny house 534.33

First we plot the top properties based on average price.

top_q2<- q2%>%
    arrange(desc(price))%>%
    top_n(10)%>% 
    ggplot(aes(x = price, y = property_type %>% reorder(price))) + 
    geom_col(fill = "Aquamarine4") + 
   
    scale_x_continuous(labels = scales::number_format(big.mark = ",")) + 
    labs(
        title    = "Top Property Type by Average price",
        x        = "Average price",
        y        = "Property Type"
    ) + 
    theme_tq() + 
    theme(axis.text.x = element_text(face = "bold"),
          axis.text.y = element_text(face = "bold"))

top_q2

Finally, we view the bottom properties based on average price.

bottom_q2<- q2%>%
    arrange((price))%>%
    top_n(-10)%>% 
    ggplot(aes(x = price, y = property_type %>% reorder(price))) + 
    geom_col(fill = "Skyblue3") + 
    scale_x_continuous(labels = scales::number_format(big.mark = ",")) + 
    labs(
        title    = "Bottom Property Type by Average price",
        x        = "Average price",
        y        = "Property Type"
    ) + 
    theme_tq() + 
    theme(axis.text.x = element_text(face = "bold"),
          axis.text.y = element_text(face = "bold")) 

bottom_q2

The obvious trend in both of the above plots is that more room is more desirable at the same price.

Top/Bottom Room Types Review Score

Which are the best 10 and worst 10 reviewed room types?

For this query, we used the same logic as in the previous section but instead of selecting property_type and price we select property_type and review_scores_rating.

q3<- listing%>%
    select(property_type,review_scores_rating)%>%
    group_by(property_type)%>%
    summarise(review_scores_rating = mean(review_scores_rating))%>%
    arrange(desc(review_scores_rating))


top_property_type_review_scores_rating <-q3 %>%
    arrange(desc(review_scores_rating))%>%
    top_n(10)%>% 
    knitr::kable(align = c("l", "c"),
                 format.args = list(big.mark = ","),
                 digits = 2)

top_property_type_review_scores_rating
property_type review_scores_rating
Barn 5.00
Entire dorm 5.00
Entire hostel 5.00
Pension 5.00
Private room in farm stay 5.00
Shared room in barn 5.00
Entire cottage 4.92
Private room in dorm 4.75
Room in resort 4.62
Entire chalet 4.58
bottom_property_type_review_scores_rating <-q3 %>%
    arrange(review_scores_rating)%>%
    top_n(-10)%>% 
    knitr::kable(align = c("l", "c"),
                 format.args = list(big.mark = ","),
                 digits = 2)

bottom_property_type_review_scores_rating
property_type review_scores_rating
Shared room in cave 4.25
Entire bed and breakfast 4.30
Shared room in villa 4.50
Entire chalet 4.58
Room in resort 4.62
Private room in dorm 4.75
Entire cottage 4.92
Barn 5.00
Entire dorm 5.00
Entire hostel 5.00
Pension 5.00
Private room in farm stay 5.00
Shared room in barn 5.00

Next, we plot the of the top 10 property types based on review score.

top_q3<- q3%>%
    arrange(desc(review_scores_rating))%>%
    top_n(10)%>% 
    ggplot(aes(x = review_scores_rating, y = property_type %>% reorder(review_scores_rating))) + 
    geom_col(fill = "Aquamarine4") + 
    
    scale_x_continuous(labels = scales::number_format(big.mark = ",")) + 
    labs(
        title    = "Top Property Type by Review Score Rating",
        x        = "Review Score Rating",
        y        = "Property Type"
    ) + 
    theme_tq() + 
    theme(axis.text.x = element_text(face = "bold"),
          axis.text.y = element_text(face = "bold"))

top_q3

Finally, we plot the of the bottom 10 property types based on review score.

bottom_q3<- q3%>%
    arrange((review_scores_rating))%>%
    top_n(-10)%>% 
    ggplot(aes(x = review_scores_rating, y = property_type %>% reorder(review_scores_rating))) + 
    geom_col(fill = "Skyblue3") + 
    scale_x_continuous(labels = scales::number_format(big.mark = ",")) + 
    labs(
        title    = "Bottom Property Type by Review Score Rating",
        x        = "Review Score Rating",
        y        = "Property Type"
    ) + 
    theme_tq() + 
    theme(axis.text.x = element_text(face = "bold"),
          axis.text.y = element_text(face = "bold")) 

bottom_q3

Most Common Amenities

Which are the most common amenities offered on the properties?

By looking at the amenities information in the listing data we realize that all amenities are listed in one column. In order to quantify and group the amenities we first get the max amount of amenities a property can have using the str_count() function of the stringr library.

ncols_q4 <- max(stringr::str_count(listing$amenities, ","))

Then we create multiple columns according to the max amount of amenities a property can have and populate the columns with all the amenities a property can have. We do this by using separatefunction from the tidyrlibrary.

#Create list of column names
colmn_q4 <- paste("col", 1:ncols_q4)

#Create columns filled with the amenities all properties have
q4<- listing%>%
    select(amenities)%>%
    tidyr::separate(
    col = amenities, 
    sep= ",", 
    into=colmn_q4,
    remove = FALSE)

By using pivot_longer, we prepare our data into data base format allows the occurrence of amenity types to be counted.

q4<- pivot_longer(data=q4,
                  cols = 'col 1':'col 77',
                  names_to = "col_number",
                  values_to = "separated_amenities")

Next, we tabulate all occurrences of the available amenity types.

q4<- q4%>%
    select(separated_amenities)%>%
    group_by(separated_amenities)%>%
    summarise(amenities_count=n())%>%
    na.omit()%>%
    arrange(desc(amenities_count))%>%
    top_n (10)

 
 q4_plot<-q4%>%
    ggplot(aes(x = amenities_count, y = separated_amenities %>% reorder(amenities_count))) + 
    geom_col(fill = "Aquamarine4") + 
    
    scale_x_continuous(labels = scales::number_format(big.mark = ",")) + 
    labs(
        title    = "Most Common Amenities",
        x        = "Count",
        y        = "Amenities"
    ) + 
    theme_tq() + 
    theme(axis.text.x = element_text(face = "bold"),
          axis.text.y = element_text(face = "bold"))

most_common_amenities<- q4%>%
    knitr::kable(align = c("l", "c"),
                 format.args = list(big.mark = ","),
                 digits = 2)

most_common_amenities
separated_amenities amenities_count
Air conditioning 16,289
Long term stays allowed 16,004
Essentials 15,027
Hangers 13,309
Shampoo 13,293
Hair dryer 11,946
Wifi 11,678
Kitchen 11,542
Washer 11,484
Dedicated workspace 11,047

Room Price Review Score Correlation

Does paying a higher price for ‘nicer’ accommodations correlate to an increased satisfaction level?

We notice that while an accommodation may only have one score in terms of price, it is reviewed among several different dimensions:

  1. Cleanliness
  2. Checkin
  3. Communication
  4. Location
  5. Rating
  6. Accuracy
  7. Value
  8. Overall Rating

We start by selecting the price and review related columns from the listing table, drop any rows that are missing review score, and filter out a spurious outlier. This leaves us with roughly 10,000 observations remaining, more than enough to theoretically examine correlation.

q5 <- 
    listing %>% select(price, review_scores_accuracy, 
                       review_scores_cleanliness, review_scores_checkin, 
                       review_scores_communication, review_scores_location, 
                       review_scores_rating, review_scores_value) %>% 
                drop_na() %>% 
                filter(price != 13)

We defined the following function c_plot() to handle the repetitive plotting of price versus our eight different review dimensions.

# Function defining correlation plot
c_plot <- function(df, y_val, y_name, clr = "dodgerblue4"){
    c_plot <- df %>% 
        ggplot(aes(x = price,
                   y = y_val)) +
            geom_jitter(color = clr, alpha = 0.5) + 
            scale_x_log10(label = scales::number_format(big.mark = ",")) +
            labs(x = "Price",
                 y = y_name,
                 title = y_name) + theme_tq()
    return(c_plot)
}

We then construct our correlation graphs using the above defined c_plot() function. We also use the grid.arrange() function from the gridExtra library to help align our multiple plots for parallel examination.

# Colours for correlation plot
c <- c("Aquamarine4", "Sienna3")

# Build correlation plots
q5_1 <- c_plot(q5, q5$review_scores_rating, "Rating vs Price") 
q5_2 <- c_plot(q5, q5$review_scores_accuracy, "Accuracy", clr = c[1]) 
q5_3 <- c_plot(q5, q5$review_scores_cleanliness, "Cleanliness", clr = c[1]) 
q5_4 <- c_plot(q5, q5$review_scores_checkin, "Check-in", clr = c[1]) 
q5_5 <- c_plot(q5, q5$review_scores_communication, "Communication", clr = c[2]) 
q5_6 <- c_plot(q5, q5$review_scores_location, "Location", clr = c[2]) 
q5_7 <- c_plot(q5, q5$review_scores_value, "Value", clr = c[2]) 

# Output correlation plots
q5_1
grid.arrange(q5_2, q5_3, q5_4, ncol = 3)
grid.arrange(q5_5, q5_6, q5_7, ncol = 3)

Our plots appear noisy, but the trend seems to be that properties with a lower price have a tendency to accrue more lower ratings along every dimension then properties with a higher rental price. We do see plenty of properties at the lower price range however, that have excellent review scores. Unfortunately, due to the asymmetric, non-normal distributed nature of the data, we are unable to apply the cor.test() function to determine if the correlation between price and review rating is statistically significant.

Room Listing Geographical Distribution

How are rental properties distributed by geographical location?

To see the geographical distribution of available accommodations to rent, we use the leaflet library to create an interactive map

q6 <- listing %>%
    left_join(host_info, by = "host_id") %>%
    select(host_id, host_name, listing_url, latitude, longitude, price,
           review_scores_rating, number_of_reviews, neighbourhood_cleansed) %>%
    replace_na(list(name = "No Name", host_name = "No Host Name"))


popup <- paste0("<b>", q6$name, "</b><br>",
                "Listing ID: ", q6$id, "<br>",
                "Host Name: ", q6$host_name, "<br>",
                "Price: ", q6$price, "<br>",
                "Review Scores Rating: ", ifelse(is.na(q6$review_scores_rating), 
                "No Review Yet", q6$review_scores_rating) , "<br>",
                "Number of Reviews: ", q6$number_of_reviews, "<br>",
                "<a href=", q6$listing_url, "> Click for more info</a>"
                )
leaflet(data = q6) %>% 
    addTiles() %>% 
    addMarkers(lng = ~longitude,
               lat = ~latitude, 
               popup = popup, 
               clusterOptions = markerClusterOptions())

Host Queries

Top hosts

Who are the top 10 hosts based on revenue?

We begin by joining together the two tables on the column host_id. We then select the necessary columns and create a new column called total_earnings which consists of the formula:

total_earnings = price * review_scores_rating * minimum_nights

We then remove the columns containing NA values and perform a count after grouping by the attributes host_id and host_name. At the same time, we calculate the average price and then finally, select the columns we want and arrange in descending order by the total_earnings.

q7 <- listing %>% 
    left_join(host_info, by = "host_id") %>% 
    select(host_id, host_name, price, 
           review_scores_rating, minimum_nights, number_of_reviews) %>%
    mutate(total_earnings = price * review_scores_rating * minimum_nights) %>% 
    drop_na() %>% 
    group_by(host_id, host_name) %>% 
    mutate(number_of_listing = n(),
           average_price = mean(price)) %>% 
    ungroup() %>% 
    select(host_id, host_name, total_earnings, number_of_listing, average_price) %>% 
    arrange(desc(total_earnings))

We create two plots instead of just one to examine the top posts by revenue. The first plot examines the top 10 hosts by the number of listings they have. The second plot, depicts the top 10 hosts by their total earnings.

We plot the results using a similar process for both plots with the main difference being that the y-axis for the top_host_by_listing plot is ordered by the number_of_listing column, While the top_host_by_earning is ordered by the total_earnings column

top_host_by_listing <- 
    q7 %>% 
    arrange(desc(number_of_listing)) %>% 
    select(host_name, number_of_listing) %>% 
    distinct() %>% 
    head(15) %>% 
    ggplot(aes(x = number_of_listing, y = host_name %>% reorder(number_of_listing))) + 
    geom_col(fill = "Skyblue3") +
    labs(
        title    = "Top Host by # of Listings",
        x        = "Number of Listing",
        y        = "Host Name"
    ) + 
    theme_tq() + 
    theme(axis.text.x = element_text(face = "bold"),
          axis.text.y = element_text(face = "bold"))

top_host_by_earning <- 
    q7 %>% 
    select(host_name, total_earnings) %>%
    arrange(desc(total_earnings)) %>% 
    filter(total_earnings != 16242500) %>% 
    head(15) %>% 
    ggplot(aes(x = total_earnings, y = host_name %>% reorder(total_earnings))) + 
    geom_col(fill = "Aquamarine4") + 
    scale_x_continuous(labels = scales::number_format(big.mark = ",")) + 
    labs(
        title    = "Top Host by Total Earning",
        x        = "Total Eearning (in Baht)",
        y        = "Host Name"
    ) + 
    theme_tq() + 
    theme(axis.text.x = element_text(angle = 45, face = "bold", 
                                     vjust = 0.85, hjust = 0.89),
          axis.text.y = element_text(face = "bold"))

We once again use the grid.arrange() function from the GridExtra library to view the two plots side-by-side to aid in direct comparison.

grid.arrange(top_host_by_listing, top_host_by_earning, ncol = 2)

We notice that the host Bee is the only one who appears in both Top 10 lists.

Superhost vs Host Review Score

Is there any difference in review score between superhost and normal host?

Certain hosts receive the designation of superhost which can be achieved by meeting the following criteria:

  1. 12 months experience as a host
  2. Complete a minimum of 100 nights booked
  3. Response rate 90% or higher
  4. The consistent overall rating of 4.8 or higher
  5. A cancellation rate less than 1%
  6. At least 80% of all reviews are 5-star

In order to determine the difference in review score between superhosts and regular hosts, we use the mutate() function to create a new logical column host_is_superhost.

q8 <- listing %>% 
    left_join(host_info, by = "host_id") %>% 
    select(host_id, host_name, review_scores_rating, host_is_superhost) %>% 
    drop_na() %>% 
    mutate(host_is_superhost = as.logical(host_is_superhost)) %>% 
    select(review_scores_rating, host_is_superhost)

We then create two separate boxplots after isolating only the observations that match the respective TRUE/FALSE condition for the host_is_superhost column.

q8_1 <- 
    q8[q8$host_is_superhost == FALSE, ] %>% 
    ggplot(aes(y = review_scores_rating, group = host_is_superhost)) + 
    geom_boxplot(fill = "Skyblue3") + 
    labs(
        title    = "Host Ratings",
        subtitle = "Ratings Distribution",
        x        = "Host",
        y        = "Rating"
    ) + theme_tq()  + 
    theme(axis.text.x = element_text(face = "bold"),
          axis.text.y = element_text(face = "bold")) 

q8_2 <- 
    q8[q8$host_is_superhost == TRUE, ] %>% 
    ggplot(aes(y = review_scores_rating, group = host_is_superhost)) + 
    geom_boxplot(fill = "Aquamarine3") + 
    labs(
        title    = "Superhost Ratings",
        subtitle = "Ratings Distribution",
        x        = "Superhost",
        y        = "Rating"
    ) + theme_tq() + 
    theme(axis.text.x = element_text(face = "bold"),
          axis.text.y = element_text(face = "bold"))

Finally, we once again use the grid.arrange() function from the GridExtra library to view the two plots side-by-side to aid in direct comparison.

grid.arrange(q8_1, q8_2, ncol = 2)

There does appear to be a difference in response rate, especially in terms of variance, between superhosts and regular hosts based off visual inspection with the average superhost rating also ranking slightly higher. Unfortunately, due to the non-normality of the data, we are unable to rely on a t.test() to verify statistically if our visual assumptions are correct.

Superhost vs Host Response Rate

Is there any difference in response rate between superhost and normal host?

We repeat the same process as the previous query to determine the difference in response rate

q9 <- listing %>% 
    left_join(host_info, by = "host_id") %>% 
    select(host_id, host_name, host_response_rate, host_acceptance_rate, host_is_superhost) %>%
    drop_na() %>% 
    mutate(host_is_superhost = as.logical(host_is_superhost), 
           # Transform acceptance rate and response rate
           host_response_rate = host_response_rate %>% 
               str_remove("[%]") %>% 
               as.numeric(),
           host_acceptance_rate = host_acceptance_rate %>% 
               str_remove("[%]") %>% 
               as.numeric()
    )
q9_1 <- 
    q9[q9$host_is_superhost == FALSE, ] %>% 
    ggplot(aes(y = host_response_rate, group = host_is_superhost)) + 
    geom_boxplot(fill = "Skyblue3") + 
    labs(
        title    = "Host Response Rate",
        subtitle = "Ratings Distribution",
        x        = "Host",
        y        = "Rating"
    ) + theme_tq()  + 
    theme(axis.text.x = element_text(face = "bold"),
          axis.text.y = element_text(face = "bold")) 

q9_2 <- 
    q9[q9$host_is_superhost == TRUE, ] %>% 
    ggplot(aes(y = host_response_rate, group = host_is_superhost)) + 
    geom_boxplot(fill = "Aquamarine3") + 
    labs(
        title    = "Superhost Response Rate",
        subtitle = "Ratings Distribution",
        x        = "Superhost",
        y        = "Rating"
    ) + theme_tq() + 
    theme(axis.text.x = element_text(face = "bold"),
          axis.text.y = element_text(face = "bold"))
grid.arrange(q9_1, q9_2, ncol = 2)

We see here that regular hosts have a noticeably lower first quartile for response rate than the superhosts. Although, the set of superhosts is not without its outliers raising the question if some of these hosts may soon lose their superhost status due to their lacklustre response rate.

What is the most commonly verified host information?

What is the most commonly verified host information?

Here we will answer this question applying same logic as we did on question #4 for most common amenities.

#Max amount of verified informations per host
ncols_q10 <- max(stringr::str_count(na.omit(host_info$host_verifications), ","))
#Create list of column names
colmn_q10 <- paste("col", 1:ncols_q10)


#Create columns filled with all the verified informations
q10<- host_info%>%
    select(host_verifications)%>%
    tidyr::separate(
        col = host_verifications, 
        sep= ",", 
        into=colmn_q10,
        remove = FALSE)
#Pivot longer to prepare data into Data Base format
q10<- pivot_longer(data=q10,
                  cols = 'col 1':'col 11',
                  names_to = "col_number",
                  values_to = "separated_host_verifications")
#Pipe to answer the question
q10<- q10%>%
    select(separated_host_verifications)%>%
    group_by(separated_host_verifications)%>%
    summarise(host_verifications_count=n())%>%
    na.omit()%>%
    arrange(desc(host_verifications_count))%>%
    top_n (10)



#ploting the answer for visual reference
q10_plot<-q10%>%
    ggplot(aes(x = host_verifications_count, 
               y = separated_host_verifications %>% 
                   reorder(host_verifications_count))) + 
    geom_col(fill = "Aquamarine4") + 
    scale_x_continuous(labels = scales::number_format(big.mark = ",")) + 
    labs(
        title    = "Most Common Verified Information",
        x        = "Count",
        y        = "Verified Information"
    ) + 
    theme_tq() + 
    theme(axis.text.x = element_text(face = "bold"),
          axis.text.y = element_text(face = "bold"))

q10_plot
q10 <-most_common_host_verifications<- q10%>%
    knitr::kable(align = c("l", "c"),
                 format.args = list(big.mark = ","),
                 digits = 2)

most_common_host_verifications
separated_host_verifications host_verifications_count
email 6,305
phone 6,276
government_id 3,677
offline_government_id 2,824
reviews 2,682
jumio 2,440
selfie 2,277
identity_manual 2,227
facebook 1,268
phone 1,082

Most Active Months for New Host Sign-Up

What months of the year have historically seen the most activity in terms of new hosts signing up for the service?

We begin by extracting and isolating the necessary data using the following steps:

  1. Join together the two tables.
  2. Select the relevant columns.
  3. Convert host_since_date to the type date using the as.Date() function.
  4. Separate the host_since column into three separate columns for Year, month and day respectively.
  5. Discard the unnecessary day column.
  6. Group by Year and Month and use the count() function to tabulate the results.
  7. Construct the necessary columns for the resulting plot and table in the next step
  8. Disregard any rows containing NA values
q11 <- 
    host_info %>%                                                     # 1
    left_join(listing, by = "host_id") %>% 
    select(host_id, host_since) %>%                                   # 2
    mutate(host_since_date = as.Date(host_since)) %>%                 # 3
    separate("host_since", c("Year", "Month", "Day"), sep = "-") %>%  # 4
    select(-Day) %>%                                                  # 5
    group_by(Year, Month) %>%                                         # 6
    count(Year, Month) %>% 
    ungroup() %>% 
    mutate(year_month = paste0(Year, "-", Month, "-", "01"),          # 7
           year_month_2 = paste0(Year, "-", Month), 
           joined = n) %>% 
    select(year_month, year_month_2, joined) %>% 
    mutate(year_month = as.Date(year_month)) %>% 
    drop_na()                                                         # 8

Next, we use the ggplot library’s geom_line() function to plot the data as a time series.

q11 %>%
    ggplot(aes(x = year_month, y = joined)) +
    geom_line(size = 1.2, colour = "Aquamarine4") + 
    scale_x_date(breaks = waiver(), date_breaks = "6 months") + theme_tq() + 
    theme(axis.text.x = element_text(angle = 45, face = "bold", vjust = 0.65),
          axis.text.y = element_text(face = "bold")) + 
    labs(
        title    = "Number of hosts joined",
        subtitle = "Shows the frequency rate at which new posts sign up for airbnb",
        caption  = "",
        x        = "Joined",
        y        = "Year/Month")

Since we are dealing with data over a number of years, it is helpful to also compile a list of the top 10 most active months in terms of new hosts

q11 %>% 
    select(-year_month) %>% 
    mutate(year_month = as.yearmon(year_month_2)) %>% 
    select(-year_month_2) %>% 
    select(year_month, joined) %>% 
    arrange(desc(joined)) %>% 
    head(10) %>% knitr::kable(align = c("c", "c"))
year_month joined
Jul 2018 405
Jun 2019 366
Jul 2015 358
Jul 2019 311
Aug 2015 300
Sep 2018 295
Apr 2016 284
Dec 2015 267
May 2017 261
Dec 2018 259

We see that the summer months (June-August), especially in recent years, makeup half of the busiest months in terms of new hosts joining the service. Interestingly, December also has two months in the top 10.

Database Disconnect

Finally, before exiting our program, disconnect from the database

Conclusion

We have demonstrated how to pull in real-world data, divide and clean the data into usable tables, insert the data into a database and then use that database to answer interesting questions that may help provide useful and actionable insights on which to base future decisions. R and its many libraries, specifically the tidyverse, provide a powerful framework with which to answer many interesting questions, often in only a few lines of code. We encourage you to come up with your own questions and see if you can answer them using the provided data.